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In regression problems where covariates can be naturally grouped, the group Lasso is an attrac- 
tive method for variable selection since it respects the grouping structure in the data. We study 
the selection and estimation properties of the group Lasso in high-dimensional settings when 
the number of groups exceeds the sample size. We provide sufficient conditions under which the 
group Lasso selects a model whose dimension is comparable with the underlying model with 
high probability and is estimation consistent. However, the group Lasso is, in general, not selec- 
tion consistent and also tends to select groups that are not important in the model. To improve 
the selection results, we propose an adaptive group Lasso method which is a generalization of 
the adaptive Lasso and requires an initial estimator. We show that the adaptive group Lasso is 
consistent in group selection under certain conditions if the group Lasso is used as the initial 
estimator. 
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1. Introduction 

Consider the linear regression model with p groups of covariates 

p 

Y i = ^2 l X' ik Pk+Si, i = l,..., n, 
fc=i 

where Yi is the response variable, is the error term, Xik is a 4 x 1 covariate vec- 
tor representing the fcth group and is the corresponding dk x 1 vector of regression 
coefficients. For such a model, the group Lasso (Antoniadis and Fan (2001), Yuan and 
Lin (2006)) is an attractive method for variable selection since it respects the grouping 
structure in the covariates. This method is a natural extension of the Lasso (Tibshirani 
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(1996)), in which an ^2-norm of the coefficients associated with a group of variables is 
used as a component in the penalty function. However, the group Lasso is, in general, 
not selection consistent and tends to select more groups than there are in the model. To 
improve the selection results, we consider an adaptive group Lasso method which is a 
generalization of the adaptive Lasso (Zou (2006)). We provide sufficient conditions under 
which the adaptive group Lasso is selection consistent if the group Lasso is used as the 
initial estimator. 

The need to select groups of variables arises in many statistical modeling problems 
and applications. For example, in multifactor analysis of variance, a factor with multiple 
levels can be represented by a group of dummy variables. In nonparamctric additive 
regression, each component can be expressed as a linear combination of a set of basis 
functions. In both cases, the selection of important factors or nonparametric components 
amounts to the selection of groups of variables. Several recent papers have considered 
group selection using penalized methods. In addition to the group Lasso, Yuan and Lin 
(2006) have proposed the group Lars and group non-negative garrote methods. Kim, Kim 
and Kim (2006) considered the group Lasso in the context of generalized linear models. 
Zhao, Rocha and Yu (2008) proposed a composite absolute penalty for group selection, 
which can be considered a generalization of the group Lasso. Meier, van de Geer and 
Biihlmann (2008) studied the group Lasso for logistic regression. Huang, Ma, Xie and 
Zhang (2008) proposed a group bridge method that can be used for simultaneous group 
and individual variable selection. 

There has been much work on the penalized methods for variable selection and estima- 
tion with high-dimensional data. Several approaches have been proposed, including the 
least absolute shrinkage and selection operator (Lasso, Tibshirani (1996)), the smoothly 
clipped absolute deviation (SCAD) penalty (Fan and Li (2001), Fan and Peng (2004)), 
the elastic net (Enet) penalty (Zou and Hastie (2006)) and the minimum concave penalty 
(Zhang (2007)). Much progress has been made in understanding the statistical proper- 
ties of these methods in both fixed p and p n settings. In particular, several recent 
studies considered the Lasso with regard to its variable selection, estimation and predic- 
tion properties; see, for example, Knight and Fu (2001), Greenshtein and Ritov (2004), 
Mcinshauscn and Buhlmann (2006), Zhao and Yu (2006), Huang, Ma and Zhang (2006), 
van de Geer (2008) and Zhang and Huang (2008), among others. All of these studies are 
concerned with the Lasso for individual variable selection. 

In this article, we study the asymptotic properties of the group Lasso and the adaptive 
group Lasso in high-dimensional settings when p^> n. We generalize the results concern- 
ing the Lasso obtained in Zhang and Huang (2008) to the group Lasso. We show that, 
under a generalized sparsity condition and the sparse Ricsz condition, as well as certain 
regularity conditions, the group Lasso selects a model whose dimension has the same 
order as the underlying model, selects all groups whose ^2-norms are of greater order 
than the bias of the selected model and is estimation consistent. In addition, under a 
narrow-sense sparsity condition (see page 1371) and using the group Lasso as the ini- 
tial estimator, the adaptive group Lasso can correctly select important groups with high 
probability. 

Our theoretical and simulation results suggest the following one-step approach to group 
selection in high-dimensional settings. First, we use the group Lasso to obtain an initial 
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estimator and reduce the dimension of the problem. We then use the adaptive group Lasso 
to select the final set of groups of variables. Since the computation of the adaptive group 
Lasso estimator can be carried out using the same algorithm and program for the group 
Lasso, the computational cost of this one-step approach is approximately twice that of a 
single group Lasso computation. This approach, iteratively using the group Lasso twice, 
follows the idea of the adaptive Lasso (Zou (2006)) and a proposal by Buhlmann and 
Meier (2008) in the context of individual variable selection. 

The rest of the paper is organized as follows. In Section 2, we state the results on the 
selection, bias of the selected model and convergent rate of the group Lasso estimator. 
In Section 3, we describe the selection and estimation consistency results concerning the 
adaptive group Lasso. In Section 4, we use simulation to compare the group Lasso and 
adaptive group Lasso. Proofs are given in Section 5. Concluding remarks are given in 
Section 6. 



2. The asymptotic properties of the group Lasso 

Let Y = (Yi ,...,Y„y and X = (Jfj , . . . , X p ) , where X^ is the nx dk covariate submatrix 
corresponding to the fcth group. For a given penalty level A > 0, the group Lasso estimator 

of p = (#,..., p' p y is 

i p /— 

p = argmin-(y - Xp) T (Y - Xp) + X V v^llftlla, (2-1) 

where $ = ($[,..., . 

We consider the model selection and estimation properties of p under a generalized 
sparsity condition (GSC) of the model and a sparse Riesz condition (SRC) on the co- 
variate matrix. These two conditions were first formulated in the study of the Lasso 
estimator (Zhang and Huang (2008)). The GSC assumes that for some r/i > 0, there 
exists an A C {1, . . . ,p} such that J2keA \\PkW2 < ^1, where || • H2 denotes the -^2-norm. 
Without loss of generality, let Aq = {q + 1, • • • ,£>}■ The GSC is then 

p 

E \Wkh<Vi- (2-2) 

k=q+l 

The number of truly important groups is thus q. A more rigid way to describe sparsity 
is to assume 771 = 0, that is, 

\\Pkh = 0, k = q+l,...,p. (2.3) 

This is a special case of the GSC and we call it the narrow-sense sparsity condition 
(NSC). In practice, the GSC is a more realistic formulation of a sparse model. However, 
the NSC can often be considered a reasonable approximation to the GSC, especially 
when rji is smaller than the noise level associated with model fitting. 
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The SRC controls the range of eigenvalues of the submatrix. For A C {l,...,p}, we 
define Xa = (Xf., k e A) and XUa = X' A XA/n. Note that Xa is an n x YlkeA dk matrix. 
The design matrix Xa satisfies the sparse Riesz condition (SRC) with rank q* and 
spectrum bounds < < c* < oo if 

c* < ^ X f}} < c* VA with q* = \A\ = #{k: keA} and v 6 R£*eA * . (2.4) 

Let A = {fc: ||/5fc||2>0,l<fc<p}, which is the set of indices of the groups selected by 
the group Lasso. An important quantity is the cardinality of A, defined as 

q = \A\=#{k: ||/3 fc || 2 >0,l<fc<p}, (2.5) 

which determines the dimension of the selected model. If q = 0(g), then the selected 
model has dimension comparable to the underlying model. Following Zhang and Huang 
(2008), we also consider two measures of the selected model. The first measures the error 
of the selected model: 

u> = \\(I-P)X0\\ 2 , (2.6) 

where P is the projection matrix from R n to the linear span of the set of selected groups 
and I = I nxn is the identity matrix. Thus, a) 2 is the sum of squares of the mean vector 
not accounted for by the selected model. To measure the important groups missing in 
the selected model, we define 

E Il^lll^ll4fc|| a = 0}) 1/2 . (2.7) 

We now describe several quantities that will be useful in describing the main results. 
Let d a = maxi< fc < p d kl d b = mini< fe < p d k , d = d a /d b and N d = Y%=\ d k- Define 

--w-O^f. e "? (2 ' 8) 

where i] 2 = maxAcA II Y,keA X kPkh, 

AI 1 = AI 1 (X)^2 + 4rl+4:Vd5r 2 + Adc, (2.9) 
M 2 = Ma(A) = |(1 + 4r 2 + 2dc + iV2d(l + Vc)V5r 2 + fdc 2 ), (2.10) 



M s = M 3 (A) = |(1 + 4r 2 + 4Vd5(l + 2vT+S)r 2 + Zr\ + |dc(7 + 4c)). (2.11) 



Let A n , p = 2cr^/8(l + c )d a d 2 q*cnc* \og(Nd V a n ), where Co > and a n > 0, satisfying 
pd a / (NdV a n ) 1+c ° ps 0, and A = inf{A: M\q+1 < q*}, where inf = oo. We also consider 
the constraint 

A>max{A ,A„ iP }. (2.12) 
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For large p, the lower bound here is allowed to be A„ iP = 2cr[8(l + co)d a d 2 q*cnc* log(Nd)] 1 ' 2 
with a n = 0; for fixed p, a n — > oo is required. 
We assume the following basic condition. 

(CI) The errors e±, . . . ,e n are independent and identically distributed as N(0,a 2 ). 

Theorem 2.1. Suppose that q>l and that (CI), the CSC (2.2) and SRC (2.4) are 

satisfied. Let q,Co and £2 be defined as in (2.5), (2.6) and (2.7), respectively, for the model 
A selected by the group Lasso from (2.1). Let M\,M 2 and M3 be defined as in (2.9), (2.10) 
and (2.11), respectively. If the constraint (2.12) is satisfied, then the following assertions 
hold with probability converging to 1: 



where B 1 {\) = ((A 2 ^g)/(nc*)) 1 /2 . 

Remark 2. 1 . The condition q > 1 is not necessary since it is only used to express quan- 
tities in terms of ratios in (2.8) and Theorem 2.1. If q = 0, we use r\q = nc* \fd^rii / (\db) 
and r\q = nc*n\l(y?d h ) to recover M x , M 2 and M 3 in (2.9), (2.10), (2.11), respectively, 
giving the results q < \nc* \fd^r\\j\d\ > , w 2 < 8\y/d^dbT]i/3 and C| = 0. 

Remark 2.2. If rji = in (2.2), then i\ = r 2 = and 

M 1 =2 + 4dc, M 2 = j(l + 2dc+fdc 2 ), M 3 = |(1 + f dc(7 + 4c)), 

all of which depend only on d and c. This suggests that the relative sizes of the groups 
affect the selection results. Since d> 1, the most favorable case is d= 1, that is, when 
the groups have equal sizes. 

Remark 2.3. If d\ = ■ ■ ■ = d p = 1, the group Lasso simplifies to the Lasso and The- 
orem 2.1 is a direct generalization of Theorem 1 on the selection properties of the 
Lasso obtained by Zhang and Huang (2008). In particular, when d\ = ■•• = d p = 1, 
ri, r 2 , Mi, M 2 , M3 are the same as the constants in Theorem 1 of Zhang and Huang 



q < \\Pkh > or k £ A } < M^q, 

> 2 = \\(I-P)XP\\ 2 <M 2 (\)B 2 (X), 




(2008). 



Remark 2-4- A more general definition of the group Lasso is 




(2.13) 
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where R k is a dk x dk positive definite matrix. This is useful when certain relationships 
among the coefficients need be specified. By the Cholesky decomposition, there exists 
a matrix Q k such that R k — d k Q' k Qk- Let (3* = Qkfi, and X^ = X k Q k x . Then, (2.13) 
becomes 



p 




k=l 



The GSC for (2.13) is YX= q +i W k Q' k QkPk) 1/2 < Vi- The SRC can be assumed for XQ- 1 , 
vfh&xeX-Q- x = {X 1 Q^ 1 ,...,X p Q- x ). 

Immediately, from Theorem 2.1, we have the following corollary. 

Corollary 2.1. Suppose that the conditions of Theorem 2.1 hold and A satisfies the 
constraint (2.12). Then, with probability converging to one, all groups with \\PkW2 > 
M^(\)q\ 2 / (c^c* n 2 ) are selected. 

From Theorem 2.1 and Corollary 2.1, the group Lasso possesses similar properties 
to the Lasso in terms of sparsity and bias (Zhang and Huang (2008)). In particular, 
the group Lasso selects a model whose dimension has the same order as the underlying 
model. Furthermore, all of the groups with coefficients whose £2-norms are greater than 
the threshold given in Corollary 2.1 are selected with high probability. 

Theorem 2.2. Let {c, er, T\, r-z, Co, d} be fixed and 1 < q <n <p — > 00. Suppose that the 
conditions in Theorem 2.1 hold. Then, with probability converging to 1, we have 



\\P-Ph< -=(2a y /M 1 log(N d )q + (r a + ^/dM 1 c)B 1 ) + 
y/nc* 

and 

\\XP - XPh < 2a y /M 1 log(N d )q + (2r 2 + ^dM 1 c)B 1 . 

Theorem 2.2 is stated for a general A that satisfies (2.12). The following result is an 
immediate corollary of Theorem 2.2. 



Corollary 2.2. Let A = 2cta/8(1 + c' Q )d a d' 2 q* cc*nlog(Nd) with a fixed c' Q > cq. Suppose 
that all of the conditions in Theorem 2.2 hold. We then have 

\\P-Ph = O p (y/qlog(N d )/n) and \\Xp - X0\\ 2 - O p ( y/qlog{N d )). 

This corollary follows by substituting the given A value into the expressions in the 
results of Theorem 2.2. 



c*r 2 + r% yfqX 
y c* c* n 



Consistent group selection 



1375 



3. Selection consistency of the adaptive group Lasso 

As shown in the previous section, the group Lasso has excellent selection and estimation 
properties. However, there is room for improvement, particularly with regard to selection. 
Although the group Lasso selects a model whose dimension is comparable to that of 
the underlying model, the simulation results reported in Yuan and Lin (2006) and those 
reported below suggest that it tends to select more groups than there are in the underlying 
model. To correct the tendency of overselection by the group Lasso, we generalize the 
idea of the adaptive Lasso (Zou (2006)) for individual variable selection to the present 
problem of group selection. 

Consider a general group Lasso criterion with a weighted penalty term, 



where Wk is the weight associated with the fcth group. The = Xwk can be regarded 
as the penalty level corresponding to the fcth group. For different groups, the penalty 
level Afe can be different. If we can have lower penalty for groups with large coefficients 
and higher penalty for groups with small coefficients (in the ti sense) , then we expect to 
be able to improve variable selection accuracy and reduce estimation bias. One way to 
obtain the information about whether a group has large or small coefficients is by using 
a consistent initial estimator. 

Suppose that an initial estimate /3 is available. A simple approach to determining the 
weight is to use the initial estimator. Consider 



Thus, for each group, its penalty is proportional to the inverse of the norm of This 
choice of the penalty level for each group is a natural generalization of the adaptive 
Lasso (Zou (2006)). In particular, when each group only contains a single variable, (3.2) 
simplifies to the adaptive Lasso penalty. 

Let 9 a = maxfcg^g ||/3fe||2 and 9b = minfeg^c ||/3fc|j2. We say that an initial estimator j3 is 
consistent at zero with rate r n if r n maxk<=A \\fik\\2 = O p (l), where r n — > oo as n— > oo, 
and there exists a constant > such that for any e > 0, P(minfc £j 4c ||/3fc||2 > £b#b) > l — £ 
for n sufficiently large. 

In addition to (CI), we assume the following conditions: 

(C2) the initial estimator /3 is consistent at zero with rate r„ — > oo; 




(3.1) 



fc=i 



(C3) 




^0; 



(C4) all of the eigenvalues of Ha^A" are bounded away from zero and infinity. 
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Condition (C2) assumes that an initial zero-consistent estimator exists. It is the most 
critical one and is generally difficult to establish. It assumes that we can consistently dif- 
ferentiate between important and non- important groups. For fixed p and dk, the ordinary 
least-squares estimator can be used as the initial estimator. However, when p > n, the 
least-squares estimator is no longer feasible. By Theorems 2.1 and 2.2, the group Lasso 
estimator f3 is consistent at zero with rate ^n/(q\og(Nd)). Condition (C3) restricts the 
numbers of important and non-important groups, as well as variables within the groups. 
It also places constraints on the penalty parameter and the ^-norm of the smallest 
important group. Condition (C4) assumes that the eigenvalues of S^c^c are finite and 
bounded away from zero. This is reasonable since the number of important groups is 
small in a sparse model. This condition ensures that the true model is identifiable. 

Define 

1 p 
/?* = argmin -(Y - XP)'{Y - X(3) + A^ 1 V<U&lla. (3.3) 

fe=i 

Theorem 3.1. // (C1)-(C4) and NSC (2.3) are satisfied, then 
P(\\fah^O,ktA ,\\fah = 0,k€A )^l. 

Therefore, the adaptive group Lasso is selection consistent if the conditions stated in 
Theorem 2.1 hold. 

If we use (3 as the initial estimator, then (C3) can be changed to 



(C3)* 



y/rfqQogq) \dl /2 q y/dqlogjp - q)log(N d ) 

y/n0 b n&l 



{d a qfl*^M 



0. 



We often have A = n a for some < a < 1/2. In this case, the number of non-important 
groups can be as large as exp(n 2Q /(qlog<3 1 )) with the number of important groups satis- 
fying q 5 log q/n — > 0, assuming that 0^ and the number of variables within the groups are 
finite. 

Corollary 3.1. Let the initial estimator f3 = f3, where $ is the group Lasso estimator. 
Suppose that the NSC (2.3) holds and that (CI), (C2), (C3)* and (C4) are satisfied. We 
then have 

P(\\p* k \\ 2 ytO,ktA ,\\p* k \\ 2 = 0,keAo)^l. 



This corollary follows directly from Theorem 3.1. It shows that the iterated group 
Lasso procedure that uses a combination of the group Lasso and the adaptive group 
Lasso is selection consistent. 
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Theorem 3.2. Suppose that the conditions in Theorem 2.2 hold and that 9b > tf, for 
some constant if, > 0. // A ~ 0(n Q ) for some < a < 1/2, then 



W -^h = o P [f- + ^j= o P , \\xf -xp\\ a ~o\^ q+ £y o,(V5). 

Theorem 3.2 implies that for the adaptive group Lasso, given a zero-consistent initial 
estimator, we can reduce a high-dimensional problem to a lower-dimensional one. The 
convergence rate is improved, compared with that of the group Lasso, by choosing an 
appropriate penalty parameter A. 



4. Simulation studies 

In this section, we use simulation to evaluate the finite sample performance of the group 
Lasso and the adaptive group Lasso. Let Xk = A/IIAlb, if llftlb > 0; if ||ft||2 = 0, then 
Afe = oo, Pf. = 0. We can thus drop the corresponding covariates Xk from the model and 
only consider the groups with ||ft||2 > 0. After a scale transformation, we can directly 
apply the group least angle regression algorithm (Yuan and Lin (2006)) to compute the 
adaptive group Lasso estimator ft . The penalty parameters for the group Lasso and the 
adaptive group Lasso are selected using the BIC criterion (Schwarz (1978)). 

We consider two scenarios of simulation models. In the first scenario, the group sizes 
are equal; in the second, the group sizes vary. For every scenario, we consider the cases 
p <n and p > n. In all of the examples, the sample size is n = 200. 

Example 1. In this example, there are 10 groups, each consisting of 5 covariates. The 
covariate vector is X = (X±, . . . , X±o), where Xj = (X 5 ^_ 1 - )+1 , . . . , X 5 ^_ 1 - j+5 ), 1 < j < 10. 
To generate X, we first simulate 50 random variables, i?x, • • • , R50, independently from 
A^(0, 1). Then, Zj, j = 1, . . . , 10, are simulated from a multivariate normal distribution 
with with mean zero and cov(Zj 1 ,Zj 2 ) = 0.6' J ' 1_J ' 2 '. The covariates Xi, . . . ,X 50 are gen- 
erated as 

*.(,-!)+* = Zj + ^- 1 > +fc , 1 < j < 10, 1 < * < 5. 

The random error e ~ 7V(0,3 2 ). The response variable Y is generated from Y = 
Efeii Kfa +s, where p l = (0.5, 1, 1.5, 2, 2.5), ft - (2, 2, 2, 2, 2),0 3 = ■ ■ ■ = /3 10 = (0, 0, 0, 0, 0) 

Example 2. In this example, the number of groups is p = 10. Each group consists of 
5 covariates. The covariates are generated the same way as in Example 1. However, the 
regression coefficients ft = (0.5, 1, 1.5, 1, 0.5), ft = (1, 1, 1, 1, 1), ft = (-1, 0, 1, 2, 1.5), ft = 
(-1.5, 1,0.5,0.5,0.5), ft = • •• =ft = (0,0,0,0,0). 
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Example 3. In this example, the number of groups p = 210 is bigger than the sam- 
ple size n. Each group consists of 5 covariates. The covariates are generated the same 
way as in Example 1. However, the regression coefficients 0i = (0.5, 1,1.5,1, 0.5), 2 = 
(1,1,1,1,1), 3 = (-1,0,1, 2,1.5), 4 = (-1-5,1, 0.5,0.5,0.5), 5 =...=0 21O = (0,0,0,0,0). 

Example 4- ln this example, the group sizes differ across groups. There are 5 groups 
with size 5 and 5 groups with size 3. The covariate vector is X = (Xi, . . . ,Xiq), where 
Xj = (X S (j_i^ +1 , . . . ,X 5 y_i) +5 ), 1<J<5, and Xj = (X 3 ^_ 6 - )+26 , ■ . ■ ,X 3 ^_ 6 ^ +28 ), 6< 
j < 10. In order to generate X, we first simulate 40 random variables i?i, . . . ,Rao, inde- 
pendently from 7V(0, 1). Then, Zj, j = 1, . . . , 10 are simulated with a normal distribution 
with mean zero and cov(Zj 17 Zj 2 ) =0.6^ 1_J ' 2 L The covariates Xi,,..,Xio are generated 
as 

X 5U _ 1)+k = Zj + R ^- 1)+k , l<j<5,l<fc<5, 

X 3(j -e )+25+k = Zj+ ^ 6)+25+fc , 6 < i < 10, 1 < k < 3. 

The random error e^iV(0,3 2 ). The response variable Y is generated from Y = 
EliiXkPk + s, where p l = (0.5,1,1.5,2,2.5), /3 2 = (2,0,0,2,2), /3 3 = • • • = = 
(0,0, 0,0, 0), /3 6 = (-1, -2, -3), 7 = • • • = p w = (0,0, 0). 



Example 5. In this example, the number of groups is p — 10 and the group sizes differ 
across groups. The data are generated the same way as in Example 4. However, the 
regression coefficients 0i = (0.5,1,1.5,2,2.5), 2 = (2,2,2,2,2), 3 = (-1,0,1,2,3), 4 = 
(-1.5,2,0,0,0), 5 = (0,0,0,0,0), 6 = (2,-2,1), 7 = (0,-3,1.5), 8 = (-1.5,1.5,2), 
9 = (-2,-2,-2), 1O = (0,0,0). 

Example 6. In this example, the number of groups p = 210 and the group sizes dif- 
fer across groups. The data are generated the same way as in Example 4. However, 
the regression coefficients 0i = (0.5,1,1.5,2,2.5), 2 = (2,2,2,2,2), 3 = (-1,0,1,2,3), 
04 = (-1.5, 2, 0, 0, 0), 05 = • • • = 0ioo = (0, 0, 0, 0, 0), 1O i = (2, -2, 1), 1O2 = (0, -3, 1.5), 
0103 = (-1.5, 1.5, 2), 1O4 = (-2, -2, -2), 0i O5 = • • • = 02io = (0, 0, 0). 

The results are given in Table 1, based on 400 replications. The columns in the table 
include the average number of groups selected with standard error in parentheses, the 
median number ('med') of groups selected with the 25% and 75% quantiles of the number 
of selected groups in parentheses, model error ('ME'), percentage of occasion on which 
correct groups are included in the selected model ('% incl') and percentage of occasions on 
which the exactly correct groups are selected ('% sel'), with standard error in parentheses. 

Several observations can be made from Tabic 1. First, in all six examples, the adaptive 
group Lasso performs better than the group Lasso in terms of model error and the 
percentage of correctly selected models. The group Lasso which gives the initial estimator 
for the adaptive group Lasso includes the correct groups with high probability. And the 
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improvement is considerable for models with different group sizes. Second, the results 
from models with equal group sizes (Examples 1, 2 and 3) are better than those from 
models with different group sizes (Examples 4, 5 and 6). Finally, when the dimension 
of the model increases, the performance of both methods becomes worse. This is to be 
expected since selection in models with a larger number of groups is more difficult. 

5. Concluding remarks 

We have studied the asymptotic selection and estimation properties of the group Lasso 
and adaptive group Lasso in 'large p, small n' linear regression models. For the adaptive 
group Lasso to be selection consistent, the initial estimator should possess two proper- 
ties: (a) it does not miss important groups and variables; (b) it is estimation consistent, 
although it may not be group-selection or variable-selection consistent. Under the condi- 
tions stated in Theorem 2.1, the group Lasso is shown to satisfy these two requirements. 
Thus, the iterated group Lasso procedure, which uses the group Lasso to achieve di- 
mension reduction and generate the initial estimates and then uses the adaptive group 
Lasso to achieve selection consistency, is an appealing approach to group selection in 
high-dimensional settings. 

6. Proofs 

We first introduce some notation which will be used in proofs. Let {k: \\$k\\2 > 0, k < 
rfC^C {k: X' k (Y - Xp) = AV&/||/3 fc || 2 } U {1, . . . , q}. Set A 2 = {1, . . . ,p} \A t ,A 3 = 



Table 1. Simulation study by the group Lasso and adaptive group Lasso for Examples 1-6. 
The true numbers of groups are included in [] in the first column 



cr = 3 


Group L 


asso 








Adaptive 


: group 


Lasso 






mean 


med 


ME 


% incl 


% sel 


mean 


med 


ME 


% incl 


% sel 


Ex. 1, [2] 


2.04 


2 


8.79 


100 


96.5 


2.01 


2 


8.54 


100 


99.5 




(0.18) 


(2,2) 


(0.94) 


(0) 


(0.18) 


(0.07) 


(2,2) 


(0.90) 


(0) 


(0.07) 


Ex. 2, [4] 


4.11 


4 


8.52 


99.5 


88.5 


4.00 


4 


8.10 


99.5 


98.00 




(0.34) 


(4,4) 


(0.94) 


(0.07) 


(0.32) 


(0.14) 


(4,4) 


(0.87) 


(0.07) 


(0.14) 


Ex. 3, [4] 


4.00 


4 


9.48 


93.0 


86.5 


3.94 


4 


8.19 


93.0 


92.5 




(0.38) 


(4,4) 


(1.19) 


(0.26) 


(0.34) 


(0.27) 


(4,4) 


(0.96) 


(0.26) 


(0.26) 


Ex. 4, [3] 


3.17 


3 


8.78 


100 


85.3 


3.00 


3 


8.36 


100 


100 




(0.45) 


(3,3) 


(1.00) 


(0) 


(0.35) 


(0) 


(3,3) 


(0.90) 


(0) 


(0) 


Ex. 5, [8] 


8.88 


9 


7.68 


100 


40.0 


8.03 


8 


7.58 


100 


97.5 




(0.81) 


(8,10) 


(0.94) 


(0) 


(0.49) 


(0.16) 


(8,8) 


(0.86) 


(0) 


(0.16) 


Ex 6, [8] 


12.90 


9 


14.61 


66.5 


7.0 


11.49 


8 


9.28 


66.5 


47.0 




(12.42) 


(8,11) 


(7.21) 


(0.47) 


(0.26) 


(12.68) 


(7,8) 


(5.79) 


(0.47) 


(0.50) 
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Ax \ A , A 4 = Ax n A , A 5 = A 2 \A and A 6 = A 2 n A . Thus, we have Ax = A3 U A4, 
A 3 r\Ai = 0,A 2 = A 5 UA 6 and A 5 nA 6 = 0. Let \A t \ = J2 keAi d k, N(Ai) = #{fc: k g A t }, 
i = l,...,6 and gi =iV(Ai). 

Proof of Theorem 2.1. The basic idea used in this proof follows the proof of the rate 
consistency of the Lasso in Zhang and Huang (2008). However, there are many differences 
in technical details, for example, in the characterization of the solution via the Karush- 
Kuhn- Tucker (KKT) conditions, in the constraint needed for the penalty level and in 
the use of maximal inequalities. 

The proof consists of three steps. Step 1 proves some inequalities related to qx, Co 
and £2- Step 2 translates the results of Step 1 into upper bounds for q, ui and £2- Step 3 
completes the proof by showing the probability of the event in Step 2 converging to 1 . The 
details of the complete proof are available from the website www.stat.uiowa.edu/techrep. 
We will sketch the proof in the following. 

If /3 is a solution of (2.1), then, by the KKT condition, X' k (Y - Xp) = Av^/VII Alk 
V||/3 fe || 2 >0 and -Xy/cfc< X' k (Y - Xj3) < X^fdk V\\j3 k \\ 2 = 0. We then have 

Exx 1 S Al /n= [fi Al -$ Al ) +Sr 1 1 S 12/ 3 A2 + ^ 1 1 X' Ai s/n 1 (6.1) 
nS 22 /?A 2 - nE 21 S^ 1 1 Si2^ 2 < C A2 ~ X' A e - E 21 E^S Al + E^S^X^e, (6.2) 

where S At = (S' ki , . . . , S'^ )' , S ki = Xy/d~k~s ki , s k = X' k (Y - X$)/(Xy/d^ , C Ai = (C' ki 
C' k )', C k% = \ y fd k ~I(\\j3 ki || 2 = 0)e dk .xx, all the elements of matrix e dk .xx equal 1, h g A. l 
and = X' A .X Aj /n. 
Step 1. Define 

Vx,=^xi /2 Q' A] iS A] /V^, .7-1,3,4, LJ k = (I-P Al )X A J Ak , fc = 2,...,6, 

where Q Ak j is the matrix representing the selection of variables in A k from Aj . Define 
u = X Al T,^ 1 1 Q' Ail S Ai /n-uj2/\\X Al T 1 ^ 1 1 Q' Ail S Ai /n-uj2\\2- From (6.1) and (6.2), we have 
VU(V 13 + Via) < S' Ai Q Ail ^xi^i2pA 2 + S' Al Q M x^xlX' M e/n + V^XJ2 keAi WPkh and 
IMIl <P' A2 (Cm -X' A2 e-^ 2 x^xiS Al +^21^x1 X' Al s). Then, under GSC, 

ll^ulli + IMIi < (IIV14III + IMIl) 1/a u' e + (||y 14 || 2 + \\PxX A . 2 p A2 \\2) p 2 Wy \ 1/2 

V nc4\Ax\) J 

(6-3) 

+ \/d a Xr]x + X\f d a \\/3 A5 \\ 2 . 

Step 2. Define B\ = X 2 d b q/{nc*{\Ax\)) and B\ = X 2 d b q/ (nc*(\A \ V | Ax |)). In this step, 
we consider the event \u'e\ 2 < (\Ax \ V db)Bx/ '(4qd a )- Suppose that the set Ax contains all 
large /3 k ^ 0. From (6.3), ||V U ||| < B\ + A^X'nx + AVdij 2 B 2 + idB^, so we have 



, nc*{\Ax\)( A , A I X 2 d a q _ , AX 2 d a q 



(qx -q) + <q+ ' ^ d a X m + 4W ™ m + a \ . (6.4) 

X 2 d b \ y nc*(\Ax\) nc*(\Ax\) I 
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l|w 2 ||l< 



3\ 2 



dB\ + Vd(l 



32 



'C 5 )r, 2 B 2 + 2^d am ) + —dC b Bl (6.5) 



From Zhang and Huang (2008), ||w 2 ||| > (||/3 As || 2 (".c*, 5 ) 1/2 - mf and \\X A J A2 1| 2 < 
T]2 + \\Xa 5 Pa 5 \\2 < f]2 + (ncf(\As\))^ 2 \\/3AB lb- By the Cauchy-Schwarz inequality, then, 
we have 



\\Pa b \\lnc* t5 



< 



/ d a q 
y «C*, 5 

r d2 



c*(I4s|) 



1/2 



2772 



X 2 d a q 
Ml^il) 



1/2 



2nc»(|Ai|) 



where c*^ = c*(|^4i U A 5 \). 

Step 3. Letting c*(|A TO |)=c*, c*(\A m \)=c* for AT(vl m )<g*, we have 



3i<JV(AUi4B)<g* 



, / e , 2< (|^i|v4)A 2 4 



4d a nc*(|^i 



(6.6) 



(6.7) 



We have c = C 5 = c*(|A 5 |)/c*(|vli| V \A 5 \) = c*/c« and c*. 5 = c*(|Ai U Asj) = c*. From 
(6.4), (6.5) and (6.6), ( ?1 -?)++<?< M 1<? , ||w 2 ||l < M 2 5 2 , ncH^Hl < Af 3 S 2 when 
(2.12) is satisfied. Define 



max max 

|A|=m||t/Aj|2=l,fc=l,-,m 



j X A (X' A X A )- l S A -(I- P A )Xf3 



\\x A (x' A x A )- l s A -(i- Pa)XP\\ 



(6.8) 



for \A\= qi =m> 0, S A = {S' Al , . . . , S' A J , where S Ak = X^/d^U Ak , \\U Ak \\ 2 = 1- 
Let Q A = X A (X' A X A )~ 1 , where = Xy/d^Xk for ft G A. For a given A, let Vy = 
(0, . . . , 0, 1, 0, . . . , 0) be the \A\ x 1 vector with the jth element in the Ith. group being 
1. Then, by (6.8), 



x*m — max max 

\A\=m l,j 



QaVu 


WQAVuh^^Vdl 


e\I-P A )Xp 


WQAVuh 


\\QaUa\U 


\\(I-P A )Xp\\ 2 



If we define fl mo = {(U,e): x* m < 0-^/8(1 + c )V 2 ((md b ) V d b ) \og(N d V a n ) Vm > m }, 
then (X,e) E O mo |w'e| 2 < «J 2 < (|A X | V 4)A 2 4/(4d a nc*) for JV(Ai) > m > 0. By 
the definition of a;^, it is less than the maximum of (K\ Xfceyt dk normal variables with 
mean and variance o~ 2 V 2 , plus the maximum of (^j normal variables with mean and 
variance a 2 . It follows that P{(X,e) £ OTo } — > 1 when (6.7) holds. This completes the 
sketch of the proof of Theorem 2.1. □ 
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Proof of Theorem 2.2. Consider the case when {c*,c*,ri,r2,Co,er, d} are fixed. The 
required configurations in Theorem 2.1 then become 

Mig + Kg , m<— — , % < — ■ 6 - 9 

c* n c* n 

Let At = {k: ||4|| 2 > or fe £ A Q }. Define = X Al (/? Al - (3 Al ) and <?i = X' Ai (Y - 
Wc then have || Vl ||£ > c,n\\p Al - Ml (0M ~ PaJ9i = <W - X A J Al + 
s) - \\vt\H and ||3i||oo < max fc,||4 fc || 2 >o II ^Vdk0k/\\Pkh\\oo = Ad a . Therefore, ||ui|| 2 < 
m + \\Pa 1 42 + A v /d a 7V(Ai)/(nc H «). Since \\P Al £h < 2a y /N(A 1 ) \og(N d ) with probabil- 
ity converging to 1 under the normality assumption, ||X(/3 — 0)\\2 < 2r\ 2 + ||Pai^||2 + 
X y /d a N(A 1 )/(nc^). Wc then have 

J2 WPk-pkWi) ' < < -^={m + 2<j^N(A 1 )\og(N d ) + v/dMiiBi). (6.10) 

fcGAi / v V 

Since A 2 C A , by the second inequality in (6.9), #{fc € A : \\/3 k \\ 2 > X/n} < rjq/c* ~ 
O(q). By the SRC and the third inequality in (6.9), J2keA \\0k\\ll{\\0kh > X/n} < 
J2 keAo \\X k f3 k x /{II&H2 > A/n}||I/(nc,) <rlgA 2 /(n 2 c,c*) and £ fceAo ll&ll2Wklla < 
A/n} < r\qX 2 / (c* n 2 ) . From (6.10), we then have 



/3|| 2 < -^(2a^M 1 \og(N d )q + (r 2 + v ^Mii)i? 1 ) + . M +^ 



c*<r 



11*3 - X/8|| 2 < 2a v^Mi log(iV d )g + (2r 2 + ^/dM 1 c)B 1 . 
This completes the proof of Theorem 2.2. □ 

Proof of Theorem 3.1. Let u = 0-0, W = X'e/y/ri, V{u) = £? =1 [(e, ~^ u ) 2 + 
J2i=i ^kVdk\\uk + 0kh and = miiLa(£-Xu)'(s-Xu) + YX=i ^kVdk\\u k + 0kh, where 
Afc = A/||/3fc|| 2 . By the KKT conditions, if there exists u such that 

Z A c A c(^iu A c)-W A c = -S A c/^i, ||u fe ||2<||/3fe||2 {orkeA c , (6.11) 
-C A JVn~ < S AoAg (^A§) - W Ao < C Ao /V^, (6-12) 

then \\PkW2 7^ for k = l,...,q and ||/3 fc || 2 = for k = q + 1, . . . ,p. 

From (6.11) and (6.12), (VmUg) -^ A l A cW A c = -^^ A cSa% and T, AoA o(^H,u A c) - 
W Ao =-n- 1 / 2 X Ao (I-P A oJe-n- 1 ^ AoA c^jl A oS A o. Define the events 

E x = {n- l l^- A \ AC X' A «s) k \\ 2 < M\0kh - n- 1/2 KX4 A cS A e) k \\ 2 , k G A c }, 

E 2 = {n-/ 2 \\(X' A(> (I - P A *)s) k \\ 2 < n-^WCkh - n^ 1 ^ 2 \\ (T, AoA ^'E A l A cS A ^) k \\ 2 , k G A Q }, 
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where (•)& denotes the dfc-dimensional subvector of the vector (•) corresponding to the 
fcth group. Wc then have P(||/3 fc || 2 7^0, k £ A , and ||/3 fc || 2 = 0, k <£ A ) > P(E 1 n E 2 ) and 
P(Ei n E 2 ) = 1 - P(Ef U ££) > 1 - P(Ef) - P{E$). 

First, we consider P{E{). Define R = {\\PkW2~ 1 < ciO' 1 ,k £ Ag}, where Ci is a con- 
stant. P(P X C ) = P(E1 n P) + P(E1 n P c ) < P(P X C n R) + P(P C ). By (C2), P(P C ) -> 0. 
Let N q = Efc=i ^fc) r i — ' ' ' — rv, be the eigenvalues of Ha^a^ and 71, . . . ,7at, be the 
associated eigenvectors. The jth element in the Ith group of vector S^cSa^ is uij = 

E^=i ^h'l'SA-hij- By the Cauchy-Schwarz inequality, uf } < rf 2 J^fji IMIlH^gll 2 = 
rf 2 N q \\S A o\\l < rf 2 iV g (£* =1 W- Therefore, |M| 2 < d fc rf Yd^Ac^ 1 ) 2 - 

If we define t^c = ^6> 6 - n _1 / 2 ciTf 1 gd^ /2 A^ 1 , r)A% = 7i" 1/2 S^c 1 Ac ^c£, 6l = 
n-Va x X^ o (7 - P^e, C Ag = {max feA§ || % || 2 > UA g}, then P(Pf) < P(CUg)! By Lem- 
mas 1 and 2 of Huang, Ma and Zhang (2008), P{C c Aq ) < K(d a logq) 1 / 2 /v A °, where K is 
a constant, k(d a logg) 1 / 2 /^ -> from (C3). We then have P(E{ n P) -> 0, P(£f) -> 0. 

Next, we consider P(P|). Similarly as above, define D = {H/SfcH^ 1 > r n ,k £ Ao} H P. 
P(P 2 C ) < P(P 2 C nfl)l P(D C ). By (C2), P(P C ) 0. IE&Er=i(^ok(^A § )«^| < 
Ei=*i < rf Y^Adflf \ where is the Zth element of vector E A c SUg • If wc de- 

fine v Ao = ?i _1/2 Ar„V4 - n~ 1/2 rf Vda /2 Aci(9f \ CU = {max ke A Ukh > V A }, then 
P(Q c )<P(C Ao ), P{C Ao )<K{d a \og{p-q)f' 2 Jv Ao . K{d a \og(p-q))V*/v Ao -> from 
(C3). We then have P(P| flD)^0, P(P|) ~> °- This completes the proof of Theorem 
3.1. □ 



Proof of Theorem 3.2. If we let A = {k: \\j3 k \\ 2 >0,k = l,...,p}, then Ekg*> II^Ha = 
0, the dimension of our problem (3.1) is reduced to q, q < q* and A c C Ao. By the 
definition of j3* , we have 



keA 



= ~ x J2i^(\\M 2 -\\^h)<xJ2 



keA 



\ 11/3 



k 2 



■11/3^-AI 



(6.14) 



fc||2 



If wc let ^ = ^l\0* A - (3 A ) and D = ^ A fx' A , then ||y - XJ* A \\ 2 /2 - \\Y - 
X A p A \\ 2 /2 = S' A S A /2 - (De)'5 A . By (6.13) and (6.14), S' A 8 A /2 - (De)'^ - rf < 0, so 
11^4 --Dell! - li-Delli- 27 ?* <0. By the triangle inequality, H^Hz < - P)e|| 2 + ||-De|| 2 . 
Thus, ||^||i<6||^ £ || 2 + 6 ?/ *. 

Let Di be the ith column of D. E(\\De\\%) = a 2 tr(D'D) = a 2 q. Then, with probability 
converging to 1, \\p A - p A \\ 2 < 6a 2 M l9 /(nc*) + (A^/(6^^)) 2 /2 + \\p A - f3 A \\ 2 /2. 



1384 



F. Wei and J. Huang 



Thus, for A = n a for some < a < 1/2, with probability converging to 1, 

and ||X^/3^ — X^/?^^ < sjnc* ||/3^ — fij}\2 ~ O^-y/g). This completes the proof of Theo- 
rem 3.2. □ 
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